Explore Prosper Loan Data by Jack Rong

This dataset comes from Prosper Marketplace, a peer-to-peer lender. It contains 113,937 rows of loan level data with 81 characteristics. I would like to investigate a few meaningful features such as LoanStatus, ProsperRating..numeric., ListingCategory, and DebtToIncomeRatio as well as some supporting variables.

## [1] 113937     81

After reviewing the variable definitions document, I’ve decided to focus on the subset of the prosper data by including only loans originated from August 2009 and onward. Prosper relaunched its lending platform in July 2009 after a SEC investigation and updated its own credit rating system. Also, the consumer credit industry has changed drastically since the 2008 Financial Crisis. These two pieces of information warrant overlooking earlier data. The remaining dataset has 84,964 rows.

## [1] 84964    82

Univariate Plots

It is clear that the most popular term is 36 months, or 3 years, while there are much fewer 12-month loans. I’d like to explore this variable further by breaking it down by loan status in the bivariate section.

##  [1] "Cancelled"              "Chargedoff"            
##  [3] "Completed"              "Current"               
##  [5] "Defaulted"              "FinalPaymentInProgress"
##  [7] "Past Due (>120 days)"   "Past Due (1-15 days)"  
##  [9] "Past Due (16-30 days)"  "Past Due (31-60 days)" 
## [11] "Past Due (61-90 days)"  "Past Due (91-120 days)"
LoanStatus DelinquencyStatus
Cancelled Performing
Current Performing
Completed Performing
FinalPaymentInProgress Performing
Past Due (1-15 days) Performing
Past Due (16-30 days) Performing
Past Due (31-60 days) Delinquent
Past Due (61-90 days) Delinquent
Past Due (91-120 days) Delinquent
Past Due (>120 days) Delinquent
Defaulted Delinquent
Chargedoff Delinquent

In the consumer credit world, a loan is considered deliquent if it is over 30 days past due. To simplify my analysis in later sections, I’ve mapped Prosper’s 12 loan statuses to my own classifier, loan deliquency status. I’ve created a new column for this variable.

About 10% of Prosper loans are delinquent.

A Prosper rating of 7 represents the highest quality and a rating of 1 the lowest. This rating system was implemented in 2009 after Prosper relaunched its website. I’d like to investigate whether the distribution of ratings changes over time between 2009 and 2014 in later sections.

Prosper score is a proprietary risk score calculated using historical Prosper data. It ranges from 1 to 10 with 1 being assigned to the the riskiest borrowers and 10 to the least risky borrowers.

The sudden spike at around 32% is worth looking into later in the analysis. That aside, borrower rates appear skewed to the right with a long tail. From a business perspective, this makes sense because interest rates cannot be negative in this case. Who wouldn’t want to borrow at negative rates?

After applying log transformation to BorrowerRate, I have a more promising plot that looks more like a normal distribution. In lending, each loan is not created equal. A loan with a higher dollar amount should be given a greater weight. I suspect that if I plot loan amounts against log-transformed borrower rates, I will get a nice normal distribution.

As shown by this barplot of relative frequency, the percentage difference between borrowers with home ownership and those without is not significant. I will explore and compare deliquency trends for these two greoup.

There are significant fewers borrowers who have over 100 months of employment history. This probably has something to do with Prosper’s lending strategy of targeting tech-savvy millenials.

In lending, credit scores are bucketed at increments of 20, so I adjusted the bin width here to 20 accordingly. It is sufficient to look solely at the lower range of credit scores. There are no borrowers with below 600 FICOs and very few with below 640 credit scores. On the other hand, the number of borrowers with great credit profiles is also limited. The majority of borrowers tend to have credits scores between 680 and 720.

This plot is not particularly informative on its own. It will be interesting to look at what borrowers utilizing similar levels of available bank credit share in common.

It makes sense that fewers loans should remain outstanding as the number of months since origination increases. With most 1-year loans getting paid off just a few months after origination, I would expect the number of outstanding loans to pick back up again because 3- and 5-year loans tend to pay off at a slower pace. I will break this plot into two, one for performing and another for delinquent loans in the next section.

The most frequent loan amounts are 4000, 10000, and 15000.

Growth in loan origination has been strong since 2009. 2014 data is incomplete with less than one quarter of origination included. I would assume 2014 was on track to become yet another record-breaking year for Prosper at this rate. This type of steady growth gives me more reason to dig into loan quality. To keep things simple, I will focus on four origination cohorts by year from 2010 to 2013 later in my analysis.

Univariate Analysis

What is the structure of your dataset?

There are 84,964 loans in the filtered dataset with 81 features. I added two more to assist in my analysis. I will take a closer look at 15 of them. Prosper rating has two versions, numeric and alphabetical, both of which are ordered factor variables, and it consists of the following seven levels.

(lower quality)———————>(higher quality)

1 - HR, 2 - E, 3 - D, 4 - C, 5 - B, 6 - A, 7 - AA

## 'data.frame':    84964 obs. of  15 variables:
##  $ Term                      : int  36 36 36 60 36 36 36 36 60 36 ...
##  $ BorrowerRate              : num  0.092 0.0974 0.2085 0.1314 0.2712 ...
##  $ ProsperRating..numeric.   : int  6 6 3 5 2 4 7 7 4 5 ...
##  $ ProsperScore              : num  7 9 4 10 2 4 9 11 7 4 ...
##  $ ListingCategory..numeric. : int  2 16 2 1 1 2 7 7 1 1 ...
##  $ EmploymentStatus          : Factor w/ 9 levels "","Employed",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration  : int  44 113 44 82 172 103 269 269 300 1 ...
##  $ IsBorrowerHomeowner       : Factor w/ 2 levels "False","True": 1 2 2 2 1 1 2 2 1 1 ...
##  $ CreditScoreRangeLower     : int  680 800 680 740 680 700 820 820 640 680 ...
##  $ BankcardUtilization       : num  0.21 0.04 0.81 0.39 0.72 0.13 0.11 0.11 0.51 0.7 ...
##  $ DebtToIncomeRatio         : num  0.18 0.15 0.26 0.36 0.27 0.24 0.25 0.25 0.12 0.18 ...
##  $ LoanMonthsSinceOrigination: int  0 16 6 3 11 10 3 3 22 2 ...
##  $ LoanOriginalAmount        : num  0.01 0.01 0.015 0.015 0.003 0.01 0.01 0.01 0.0135 0.004 ...
##  $ LoanOriginationYear       : chr  "2014" "2012" "2013" "2013" ...
##  $ DelinquencyStatus         : chr  "Performing" "Performing" "Performing" "Performing" ...
##       Term        BorrowerRate    ProsperRating..numeric.  ProsperScore  
##  Min.   :12.00   Min.   :0.0400   Min.   :1.000           Min.   : 1.00  
##  1st Qu.:36.00   1st Qu.:0.1359   1st Qu.:3.000           1st Qu.: 4.00  
##  Median :36.00   Median :0.1875   Median :4.000           Median : 6.00  
##  Mean   :42.48   Mean   :0.1960   Mean   :4.072           Mean   : 5.95  
##  3rd Qu.:60.00   3rd Qu.:0.2574   3rd Qu.:5.000           3rd Qu.: 8.00  
##  Max.   :60.00   Max.   :0.3600   Max.   :7.000           Max.   :11.00  
##                                   NA's   :127             NA's   :127    
##  ListingCategory..numeric.      EmploymentStatus EmploymentStatusDuration
##  Min.   : 0.000            Employed     :67322   Min.   :  0             
##  1st Qu.: 1.000            Full-time    : 8012   1st Qu.: 30             
##  Median : 1.000            Self-employed: 4542   Median : 74             
##  Mean   : 3.312            Other        : 3806   Mean   :103             
##  3rd Qu.: 3.000            Not employed :  649   3rd Qu.:147             
##  Max.   :20.000            Retired      :  370   Max.   :755             
##                            (Other)      :  263   NA's   :19              
##  IsBorrowerHomeowner CreditScoreRangeLower BankcardUtilization
##  False:40061         Min.   :600.0         Min.   :0.0000     
##  True :44903         1st Qu.:660.0         1st Qu.:0.3300     
##                      Median :700.0         Median :0.6000     
##                      Mean   :699.4         Mean   :0.5641     
##                      3rd Qu.:720.0         3rd Qu.:0.8300     
##                      Max.   :880.0         Max.   :2.5000     
##                                                               
##  DebtToIncomeRatio LoanMonthsSinceOrigination LoanOriginalAmount
##  Min.   : 0.000    Min.   : 0.00              Min.   :0.001000  
##  1st Qu.: 0.150    1st Qu.: 4.00              1st Qu.:0.004000  
##  Median : 0.220    Median :11.00              Median :0.007500  
##  Mean   : 0.259    Mean   :16.06              Mean   :0.009077  
##  3rd Qu.: 0.320    3rd Qu.:25.00              3rd Qu.:0.013500  
##  Max.   :10.010    Max.   :55.00              Max.   :0.035000  
##  NA's   :7306                                                   
##  LoanOriginationYear DelinquencyStatus 
##  Length:84964        Length:84964      
##  Class :character    Class :character  
##  Mode  :character    Mode  :character  
##                                        
##                                        
##                                        
## 

What is/are the main feature(s) of interest in your dataset?

The main features in the dataset are BorrowerRate and LoanStatus. I’m interested in determining how to predict delinquency and principal recovery if a loan becomes delinquent. The former is well within the scope of this analysis while the latter is likely something I will continue exploring as a side personal project because it involves advanced statistical modeling.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

The dataset has a wealth of borrower information for each loan. I’d like to narrow the long list down to the following features, which could shed some insights into delinquency.

  • ListingCategory
  • EmploymentStatusDuration
  • IsBorrowerHomeowner
  • CreditScoreRangeLower
  • BankCardUtilization
  • DebtToIncomeRatio
  • LoanMonthsSinceOrigination
  • LoanOriginationYear

Prosper obviously devotes its resources to developing its own credit rating and risk scoring system. I’d also like to validate the predictive power of these two proprietary measures.

Did you create any new variables from existing variables in the dataset?

I created two new variables from existing variables. I derived a boolean variable, DelinquencyStatus, from LoanStatus. In this case, a loan is considered delinquent if it is 30 or more days past due. The creation of LoanOriginationYear was straightforward. I simply formatted LoanOriginationDate.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

I log-transformed the right-skewed, long-tailed BorrowerRate distribution. I applied this transformation because in finance, it is known that interest rates have a lower bound at zero and follow a log-normal distribution. One simple explanation is that borrowers would get paid to take out loan at negative rates. Although the log-transformed distribution still does not look normal as expected, I believe a bivariate plot of loan amount and log-transformed rate is what I’m looking for.

Bivariate Plots

Most performing loans are paid off before the end of their terms (12, 36, and 60 months) while the number of delinquent loans peaks at 20 months after origination. It is important to note that according to this plot, loans are less likely to become delinquent once they reach the month 20 milestone.

This confirms my previous univariate plot that on the Prosper platform, the percentage of 3-year loans is the highest and that of 5-year loans is the second highest.

This plot with loan amount on the y-axis confirms the exponential growth that Prosper has experienced since 2009.

Debt consolidation is the most popular type of loans among Prosper’s borrowers.

Note here that numeric values of ListingCategory is mapped to a list of word descriptions. The top five listing categories by loan amount are listed below with both numeric indexes and descriptions.

The longer the term, the higher the loan amount. I have a hunch that this has something to do with the loan purpose, or the listing category in this dateset.

The above plot reveals that borrowers pay higher rates with longer terms. From a risk perspective, longer terms are associated with greater uncertainties, so investor would demand more returns to compensate for more risks.

If we overlook the vertical cluster at 0% usage of available revolving credit and the horizontal cluster at 30+% borrower rate across all revolving credit utilization ratios, we can notice a weak direct relationship between bank card utlization and borrower rate.

There isn’t a notable trend in this plot. For borrower rates below 20%, greater debt to income (DTI) ratios are weakly correlated to higher borrower rates; however, this does not hold for borrower rates above 20% and especially at 30+% where borrower rates scatter across a range of DTI ratios.

Delinquency tends to happen between month 15 and and month 31. This confirms our prior observation from our frequency polygon of LoanMonthsSinceOrigination in the univariate section. Because of an increase in delinquent loans, a greater number of loans is outstanding roughly 20 months after origination.

Zooming in and only looking at borrowers whose employment durations are within 5 years, I discover that there is a good concentration of delinquent borrowers who have held their current employment status for less than 2 years. On the one hand, these borrowers are unlikely to have access to credit from traditional lenders. On the other hand, they have riskier profiles and are more likely to become delinquent. This is a great business opportunity for alternative lenders to dominate an underserved market segment.

I’d like to find out how Prosper’s risk rating system compared to the standard credit score metric. In general, higher Prosper ratings correspond to better credit scores. The only exception is for loans with the lowest Prosper rating. I will investigate which credit risk measure is more effective in identifying delinquency in the multivariate section.

As suspected in the previous section, the plot of log-transformed borrower rate and loan amount looks less skewed than the original one without the transformation. There is a a good number of loans concentrated in the lower right where loan amounts are low and borrower rates are high. This explains why my univariate graph reveals a large number of loans with 30+% rates.

The higher the FICO or Prosper score, the better quality the loan, the lower the rate. To determine which metric is more effective, I will need to conduct a statistical analysis.

It makes great sense that ProsperRating..numeric. and ProsperScore are highly correlated. Also, I can conclude that BorrowerRate and ProsperRating..numeric. are the pair with the strongest relationship with a correlation coefficient of -1.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

I’ve discovered that loans tend to become delinquent roughly 20 months after origination and that borrowers with shorter durations of current employment status are more likely to miss their payments or default.

In terms of borrower rates, I’ve observed several interesting relationships. One of them is the term structure of interest rates, or the way rates change with loan terms. Prosper offers 3 different terms: 1-, 3-, and 5-year. It is worth noting that borrowers taking on 1-year loans, the least popular by loan count, have the lowest rates while 3-year loans, the most popular by loan count, have the highest rates. From an economic perspective, the longer the loan terms, the greater the uncertainties around future payments and the risks of delinquency. If borrower rates are an efficient reflection of the credit worthiness of borrowers, only borrowers with generally better credit profiles can take out 5-year loans. Therefore, they are likely to receive lower rates despite longer loan terms.

Another one is a weak correlation between bank card utlization and borrower rate. This is logical because if borrowers have already used some of their revolving credit lines available to them from their banking institutions, they will have to pay higher rates to acccess more credit.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

It is very clear that the traditional FICO credit score and the Prosper score share much in common in measuring the credit risk profile of borrowers and their loans. These two features, CreditScoreRangeLower and ProsperRating..numeric., have one of the highest correlation coefficients among all the 15 variables.

It is quite a surprising discovery that there is not apparent relationship between debt to income (DTI) ratio and borrower rate given the importance of DTI in the consumer credit industry. DTI is the main measure for borrower’s ability to repay in major financial regulations. However, it does not seem to be a determinant for interest rates for Prosper borrowers.

What was the strongest relationship you found?

My most interesting discovery is that BorrowerRate and ProsperRating..numeric. have a correlation coefficient of -1 and exhibit the strongest relationship among all the variables in my analysis. Based on this finding, I can infer that the borrower rate is set by Prosper according to the loan’s Prrosper rating.

Multivariate Plots

I’ve compared the traditional FICO credit score and the proprietary Prosper score in the previous section. These two plots above allow me to look at the two credit metrics and determine which one is more effective.

Traditionally, debt to income (DTI) ratio is used to predict the likelihood of loan delinquency and repayment. However, it is worth noting that at higher borrower rates, even loans with low DTI ratios become delinquent . One explanation is that DTI is a static measure taking into account the income at the time of loan application only. Borrowers paying 30% interest rates still have reasonable incentives not to continue making payments if they lose their jobs for example.

This graph does not reveal any meaningful relationships. It is expected that borrowers using the Prosper platform would have high revolving credit usage with their banking institutions. Since delinquent loans are well-spread-out, BankCardUtilization does not seem to be a useful variable for future analysis.

A few things jump out right away here. First, Prosper does a good job at limitiing delinquencies for loans of larger amounts. Second, loans with the most popular listing categories perform quite well in aggregate. Third, the following listing categories seem to have a higher percentage of delinquent loans.

The two plots above are similar. The latter comes from adding a facet of numeric Prosper rating (the higher the rating the better the loan quality) to the former.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

It is also interesting to find out that loans in certain listing categories are more like to turn delinquent than others. I suspect that this has much to do with behavioral psychology for these types of borrowers and that alternative sources of data could help predict delinquencies in this situation. I believe Prosper already employs such technique to compute its Prosper score for each loan because it is visually evident that delinquency ratio is lower for loans with higher Prosper scores than those with higer FICO credit scores.

Looking at Prosper’s origination volume by year, I find it interesting that Prosper’s rapid growth is driven mostly by loans with numeric Prosper ratings between 4 and 6. Loans with other ratings have only experienced either modest or inconsistent growth over the same period. From a strategic perspective, it is difficult to find borrowers with pristine credit profiles because they usually do not overleverage their financials. On the other hand, originating less than prime loans leads to excessive risks taken on by the platform and its investors.

Were there any interesting or surprising interactions between features?

The most surprising finding is that two variables, debt to income ratio and bankcard utilization, have no effect on a loan’s rate or its delinquency status. Not only do they have low correlation coefficients, 0.1 and 0.2, respective, but they also do not seem to generate clear trends when it comes to predicting delinquency.


Final Plots and Summary

Plot One

Description One

This plot reveals the distribution of borrower rates by loan term and separates data to two groups, delinquent and performing loans. 3-year loans are the most common on the Prosper platform. It is also true that borrower rates are higher for longer terms in general. There is an exception where some 3-year loans have rates around 35%, which are the highest rates in this dataset. I suppose that as a way to keep aggregate risks on the platform under control, Prosper does not originate 5-year loans of certain qualities that would demand borrower rates exceeding 35%.

This is one of the simplest plots to get a sense of Prosper’s risk management framework. Economically speaking, investors demand higher returns for taking on greater risks. And if borrower rates are an efficient reflection of the credit worthiness of borrowers, only borrowers with generally better credit profiles can take out 5-year loans. Therefore, they are likely to receive lower rates despite longer loan terms. Any factors contributing to the risk profile of a loan should be correlated to the loan’s borrower rate because the difference between investor return and borrower rate is the servicing charge earned by the origination platform, Prosper.

Plot Two

Description Two

Comparing the above two plots, I find that delinquencies are prevalent among loans with higher rates for both credit metrics. This is especially true for borrowers with low credit scores and low Prosper scores. However, it is visually evident that the Prosper score system is more effective in ensuring that fewer delinquencies occur for loans at lower rates and borrowers with better credit profiles than the traditional FICO credit score regime.

Plot Three

Description Three

Here are some interesting observation. First, volume growth is mainly driven by loans with middle ratings. Second, there is no significant increase in the ratio of delinquent loans between the 2011 and 2012 cohorts (2013 cohort is not yet mature and is excluded here for comparison). Last, delinquency ratios seem to rise for lower ratings, validating Prosper proprietary rating system.


Reflection

The original dataset from Prosper is quite enormous with 113937 rows and 81 variables. I do have some experience in consumer lending, so I’ve spent less time familiarizing myself with the background information on the loan data and more time exploring interesting trends in the dataset. The real struggle comes down to organizing the dataset and understanding it well enough so that I can select a representative subset of variables and discover useful insights.

From a business standpoint, borrower rates should efficiently reflect borrowers’ credit worthiness. That means what contributes to the credit profile of borrowers has a lot to do with what rates they have to pay. In addition, I’m interesting in investigating what causes loan delinquencies.

Having completed this project, I believe there is a lot more to find out from the original dataset. I’ve only explored 15 variables out of 81. I’d be curious to compare estimated losses against actual principal losses and discover whether behaviors of delinquent borrowers vary by state.

Instead of mainly looking at simple averages as is the case in this analysis, I also hope to investigate whether weighted averages by loan amount change my conclusions and provide further insights into Prosper loan data. Again, each loan is not created equal. One with a larger loan size means a greater amount of risks for Prosper and its investors. It should be weighted more when calculating averages.

Ultimately, I believe alternative data could further enhance delinquency analysis. It would be helpful if this dataset included some demographic information of borrowers, for example, age, zip code, education, etc. In addition, metadata on user behaviors when applying for Prosper loans on the website can and will shed practical insights into borrower psychologies and help Prosper understand their borrowers from a behavioral perspective. This could help improve its predictive model for delinquencies and defaults.